[SDBM-2746] Collect standard system-table metrics per node in single endpoint mode#24266
Open
sangeetashivaji wants to merge 9 commits into
Open
[SDBM-2746] Collect standard system-table metrics per node in single endpoint mode#24266sangeetashivaji wants to merge 9 commits into
sangeetashivaji wants to merge 9 commits into
Conversation
In single endpoint mode the agent reaches a multi-node ClickHouse cluster through one load balancer, but the standard metric queries read per-node system tables directly (bare system.events/metrics/errors). Consecutive scrapes land on different nodes, so cumulative counters appear to jump up and down and the backend reports phantom monotonic_count spikes such as the false clickhouse.query.failed.count in SDBM-2746. Route the bulk-match metric queries and system.errors through clusterAllReplicas() and tag each row with hostName() when single_endpoint_mode is enabled, giving each node its own metric series (counters increment monotonically per node; gauges stop flapping). Direct connections are unchanged. system.parts/replicas/dictionaries use GROUP BY and are left for a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 3054c6f | Docs | Datadog PR Page | Give us feedback! |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the manual search + start()/end() slicing with a single SYSTEM_TABLE_FROM_CLAUSE.subn() pass (count=1). The regex now consumes the whitespace before FROM so the per-node projection splices in without a gap before the comma, keeping the rewritten query byte-for-byte identical. Output and behavior are unchanged; all unit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ruff 0.11.10 reformats the parenthesized assert condition back to the trailing-message form. Apply it so `ddev test --fmt` / CI lint passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Instead of rewriting system-table metric queries with a regex at runtime, define static cluster-aware query variants and select them in get_queries() based on single_endpoint_mode. - Add cluster_aware_query() helper in utils.py that reuses the base query's columns by reference (no duplication) and appends the clickhouse_node tag. - Define legacy variants in queries.py and advanced variants in advanced_queries (JSON-backed via load_match_query + SystemErrorsClusterAware). - Remove import re, SYSTEM_TABLE_FROM_CLAUSE, and make_cluster_aware(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cluster_aware_query now takes only the base query dict and derives the SELECT list and table from base['query'], so call sites no longer restate column lists or table names. load_match_query reverts to its original signature; __getattr__ wraps the loaded query for ClusterAware names. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the static *ClusterAware definitions and the __getattr__ suffix handling; get_queries wraps each base query via cluster_aware_query() when single_endpoint_mode is on. queries.py and advanced_queries revert to base; the change now touches only utils.py, clickhouse.py, and the tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Validation ReportAll 21 validations passed. Show details
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Makes the standard (non-DBM) ClickHouse system-table metric collection cluster-aware when
single_endpoint_modeis enabled. The bulk-match metric queries (system.events,system.metrics,system.asynchronous_metrics) andsystem.errorsare now routed throughclusterAllReplicas('default', system.<table>)and tagged withhostName() AS clickhouse_node, so each cluster node gets its own metric series.system.parts,system.replicas, andsystem.dictionariesuseGROUP BYand are intentionally left unchanged here (follow-up).Motivation
On a multi-node ClickHouse Cloud cluster reached through a single load-balancer endpoint, the standard metric queries read per-node
systemtables directly.Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged